Final: Effects of Air Pollution on Countries

1. Introduction

Motivation

Climate issues are pervasive but typically disproportionately affect low income communities and developing countries. Our group wanted to explore how air pollution has changed over time and affect countries differently. Specifically, we wanted to analyze how a country’s economic and social position can either increase, decrease, or not have observable impact on the affects of air pollution. In laymen terms, does air pollution affect underdeveloped countries disproportionately?

Set Up

Before we start, we need to ensure that we have all the relevant libraries installed and imported.

Run these in the console, or only the ones that your system does not have, to install packages in addition to the ezids package.

install.packages("tidyverse")
install.packages("rworldmap")
install.packages("tmap")
install.packages("spData")
install.packages("sf")
install.packages("ggpubr")
install.packages("dplyr")
install.packages("knitr")
install.packages("magrittr")
install.packages("factoextra")
install.packages("cluster")
install.packages("caret")

2. Data Sources and Data Wrangling

Data Sources

For our analysis, we will be working with 5 main data sources shown in the table below:

Figure 1: Data Sources
Data Source Link
Deaths Due to Air Pollution of Countries from 1990 - 2017 Kaggle Link
GDP Annual Growth of Countries from 1960 - 2020 Kaggle via WorldBank Link
United Nations Population and Region Data United Nations Link
United Nations ISO-alpha3 code United Nations Link
spData for Map Geometries spData for Mapping Link

The main variables in our datasets will include:

Figure 2: Key Variables
Feature Data Type Unit of Measure Notes and Assumptions
GDP (Gross Domestic Product) Numerical, Continuous $USD This is our chosen proxy for measuring a country’s economic status
Population Size Numerical, Continuous thousands of people Annual UN estimated
Deaths due to Air Pollution Numerical, Continuous deaths per million This is our chosen proxy for measuring the negative affects of air pollution.
Country Qualitative, Categorical N/A 231 countries
SDG Region Qualitative, Categorical N/A UN’s Sustainable Development Goals Region Classification.
Sub Region Qualitative, Categorical N/A UN’s Sustainable Development Goals Sub-Region Classification.
ISO-alpha3 Country Code Qualitative, Categorical N/A Standard for identifying countries (text ID).
ISO-alpha2 Country Code Qualitative, Categorical N/A Another standard for identifying countries (text ID).
M49 Country Code Numerical, Categorical N/A Another standard for identifying countries (numerical ID).
Year Numerical, Categorical N/A 1990 to 2017
GDP per Capita Numerical, Continuous $USD per person Normalization of GDP to compare between population sizes (calculated).

Data Wrangling

While data from Kaggle are already in a format to be cleaned, downloaded data from United Nations required a little data wrangling. Mainly, we needed to extract just countries’ data from the Excel workbooks and into their own contained csv files. Since we only need to do this once and programming it would take significant time to choose the specific cells that we need, we opted to perform this step outside of R and in Excel. Note that if this were a part of a real production data pipeline, we would take the time to program the data extraction but would likely choose a different programming language such as Python that is a bit more robust in these types of tasks like web scraping and data transformations in Pandas.

UN Data Sample Messy
Figure 3: Sample screenshot of data downloaded from UN including unnecessary elements like banners and other regional data.
UN Data Sample Cleaned
Figure 4: Sample screenshot of transformed UN dataset.

3. Load, Clean, and Inspect Data

Load Data

Figure 5: Structure of country_codes_df
variable class first_values
Country.or.Area character Andorra, United Arab Emirates (the), Afghanistan, Antigua and Barbuda, Anguilla, Albania
ISO.alpha2.code character AD, AE, AF, AG, AI, AL
ISO.alpha3.code character AND, ARE, AFG, ATG, AIA, ALB
M49.code integer 20, 784, 4, 28, 660, 8
Figure 6: Structure of air_pollution_df
variable class first_values
Entity character Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan
Code character AFG, AFG, AFG, AFG, AFG, AFG
Year integer 1990, 1991, 1992, 1993, 1994, 1995
Air.pollution..total…deaths.per.100.000. double 299.477308883281, 291.277966734046, 278.963055615066, 278.790814746341, 287.162923177255, 288.01422374243
Indoor.air.pollution..deaths.per.100.000. double 250.362909742375, 242.575124973334, 232.043877894811, 231.648133503794, 238.837176822107, 239.906598716878
Outdoor.particulate.matter..deaths.per.100.000. double 46.4465894382846, 46.0338405670284, 44.2437660321924, 44.4401481443785, 45.5943284100213, 45.3671411300974
Outdoor.ozone.pollution..deaths.per.100.000. double 5.61644203074918, 5.60396011603667, 5.61182206482564, 5.65526606275628, 5.71892222061506, 5.73917378233707
Figure 7: Structure of gdp_df, not showing all years in table to retain space but years go up to X2020
variable class first_values
Country.Name character Aruba, Afghanistan, Angola, Albania, Andorra, Arab World
Country.Code character ABW, AFG, AGO, ALB, AND, ARB
Indicator.Name character GDP (current US\(), GDP (current US\)), GDP (current US\(), GDP (current US\)), GDP (current US\(), GDP (current US\))
Indicator.Code character NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD
X1960 double NA, 537777811.111111, NA, NA, NA, NA
X1961 double NA, 548888895.555556, NA, NA, NA, NA
X1962 double NA, 546666677.777778, NA, NA, NA, NA
X1963 double NA, 751111191.111111, NA, NA, NA, NA
X1964 double NA, 800000044.444444, NA, NA, NA, NA
X1965 double NA, 1006666637.77778, NA, NA, NA, NA
Figure 8: Structure of population_region_df, not showing all years in table to retain space but years go up to X2020
variable class first_values
SDGRegion character SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA
SubRegion character Eastern Africa, Eastern Africa, Eastern Africa, Eastern Africa, Eastern Africa, Eastern Africa
Country character Burundi, Comoros, Djibouti, Eritrea, Ethiopia, Kenya
Notes integer NA, NA, NA, NA, NA, NA
Country.code integer 108, 174, 262, 232, 231, 404
Type character Country/Area, Country/Area, Country/Area, Country/Area, Country/Area, Country/Area
Parent.code integer 910, 910, 910, 910, 910, 910
X1950 character 2 309, 159, 62, 822, 18 128, 6 077
X1951 character 2 360, 163, 63, 835, 18 467, 6 242
X1952 character 2 406, 167, 65, 849, 18 820, 6 416
Figure 9: Structure of world, not showing geom feature in table as it has unique list of values per row and therefore is extremely large to display.
variable class first_values
iso_a2 character FJ, TZ, EH, CA, US, KZ
name_long character Fiji, Tanzania, Western Sahara, Canada, United States, Kazakhstan
continent character Oceania, Africa, Africa, North America, North America, Asia
region_un character Oceania, Africa, Africa, Americas, Americas, Asia
subregion character Melanesia, Eastern Africa, Northern Africa, Northern America, Northern America, Central Asia
type character Sovereign country, Sovereign country, Indeterminate, Sovereign country, Country, Sovereign country
area_km2 double 19289.9707329765, 932745.792357074, 96270.6010408472, 10036042.9767873, 9510743.74482458, 2729810.51298781
pop double 885806, 52234869, NA, 35535348, 318622525, 17288285
lifeExp double 69.96, 64.163, NA, 81.9530487804878, 78.8414634146341, 71.62
gdpPercap double 8222.25378436842, 2402.09940362843, NA, 43079.1425247165, 51921.9846391384, 23587.3375151466

Clean Data

First thing that we need to drop unnecessary columns and set datatypes (factor, num, etc.).

Clean air_pollution_df:

Figure 10: Structure of air_pollution_df_cleaned
variable class first_values
Country integer Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan
ISO.alpha3.code integer AFG, AFG, AFG, AFG, AFG, AFG
Year integer 1990, 1991, 1992, 1993, 1994, 1995
Deaths.Air.Pollution.per.100k double 299.477308883281, 291.277966734046, 278.963055615066, 278.790814746341, 287.162923177255, 288.01422374243

Clean gdp_df:

Figure 11: Structure of gdp_df_cleaned
variable class first_values
Country integer Aruba, Aruba, Aruba, Aruba, Aruba, Aruba
ISO.alpha3.code integer ABW, ABW, ABW, ABW, ABW, ABW
Year integer 1986, 1987, 1988, 1989, 1990, 1991
GDP.USD double 405463417.11746, 487602457.746416, 596423607.114715, 695304363.031101, 764887117.194486, 872138715.083799

Clean population_region_df:

Figure 12: Structure of population_region_df_cleaned
variable class first_values
SDGRegion integer SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA
SubRegion integer EasternAfrica, EasternAfrica, EasternAfrica, EasternAfrica, EasternAfrica, EasternAfrica
Country integer Burundi, Burundi, Burundi, Burundi, Burundi, Burundi
M49.code integer 108, 108, 108, 108, 108, 108
Year integer 1950, 1951, 1952, 1953, 1954, 1955
Population.thousands double 2309, 2360, 2406, 2449, 2492, 2537

Clean population_region_df:

Figure 13: Structure of country_codes_df
variable class first_values
Country.or.Area integer Andorra, United Arab Emirates (the), Afghanistan, Antigua and Barbuda, Anguilla, Albania
ISO.alpha2.code integer AD, AE, AF, AG, AI, AL
ISO.alpha3.code integer AND, ARE, AFG, ATG, AIA, ALB
M49.code integer 20, 784, 4, 28, 660, 8

Clean world:

Figure 14: Structure of world_df_cleaned, not showing geom feature in table as it has unique list of values per row and therefore is extremely large to display.
variable class first_values
iso_a2 integer FJ, TZ, EH, CA, US, KZ

Note that we only have geometries for 175 countries, some will not be able to be plot on a map but that is okay.

Final DataFrame Construction

Now let’s merge our 4 datasets into one using a series of inner joins using country code and year as keys depending on the specific join. We are using inner joins because we want to drop all null values which would mean either a country does not have a country code or we have more years of data than our smallest year range (the air pollution dataset).

Figure 15: Structure of final_df
variable class first_values
ISO.alpha2.code integer AD, AD, AD, AD, AD, AD
M49.code integer 20, 20, 20, 20, 20, 20
Year integer 2012, 2013, 1990, 1991, 1992, 1993
ISO.alpha3.code integer AND, AND, AND, AND, AND, AND
Country.x integer Andorra, Andorra, Andorra, Andorra, Andorra, Andorra
Deaths.Air.Pollution.per.100k double 17.6754871826169, 17.1893417774086, 29.0238806202567, 28.6956788863825, 28.4603211317312, 27.8408717612189
GDP.USD double 3188808942.56713, 3193704343.20627, 1029048481.88051, 1106928582.86629, 1210013651.87713, 1007025755.00065
SDGRegion integer EUROPE, EUROPE, EUROPE, EUROPE, EUROPE, EUROPE
SubRegion integer SouthernEurope, SouthernEurope, SouthernEurope, SouthernEurope, SouthernEurope, SouthernEurope
Population.thousands double 82, 81, 55, 57, 59, 61
geom list list(), list(), list(), list(), list(), list()
gdp.per.capita double 38887.9139337455, 39428.4486815589, 18709.9723978275, 19419.7996994086, 20508.7059640192, 16508.6189344369

Our dataset is finally ready to be analyzed.

4. EDA Summary

All necessary EDA was performed in the Midterm Report where we looked at distributions for each key numerical variable in Air Pollution Induced Deaths per 100k, Population, GDP per Capita. We also looked into boxplots of these key numerical variables by Regions and Sub Regions. Interestingly, we observed that the same subregions that have low deaths caused by air pollution also have high GDP per capita comparatively.

Next, we checked for outliers and found that Deaths.Air.Pollution.per.100k definitely do not need to have outliers removed as it only represents 0.7% of the dataset. gdp.per.capita have higher percentage of classified outliers at 13.5%, however, we felt that keeping the extreme datapoints of this feature was important in our research and analysis to capture the true disproportionate distribution of wealth and progress across countries in our world.

Lastly, we explored a few choropleth maps to visually understand our dataset better.

We intentionally did not include any figures from EDA in this final report to preserve space. Please look at the Midterm Report for details on our entire EDA process.

5. Main Midterm Research Question

Below is a rehash of our main research question from the Midterm report and we felt that the findings from this model was important enough to warrant inclusion in this final report as well since some of our new SMART questions are building on the work from this analysis.

Do lower GDP countries have more deaths per 100k due to air pollution?

Is there a correlation between GDP per capita and deaths caused by pollution? Is it linear? How strong is the correlation?

Linear Fit

Let’s first look at the general fit on the overall data.

Figure 15: Linear model (fit1) on overall data, deaths due to air pollution per 100k vs GDP per capita, 1990 to 2017.

From the plot, we observed that there is indeed a negative correlation between deaths due to air pollution per 100k and GDP per capita. However, the strength of that relationship is not particularly strong as the R2 is really low at 0.295. This means that only 29% of the variance experienced in deaths due to air pollution per 100k is caused by GDP per capita in a linear relationship.

Transformed Log Scale - Linear Fit

We then performed a log transformation and found that our linear regression fits much better.

fit2’s summary statistics are:

## 
## Call:
## lm(formula = log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita), 
##     data = final_df_sf)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.099 -0.235  0.000  0.206  1.431 
## 
## Coefficients:
##                     Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)          7.38778    0.02663     277 <0.0000000000000002 ***
## log(gdp.per.capita) -0.38952    0.00323    -121 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.369 on 5195 degrees of freedom
## Multiple R-squared:  0.737,  Adjusted R-squared:  0.737 
## F-statistic: 1.45e+04 on 1 and 5195 DF,  p-value: <0.0000000000000002

We re-plotted the linear model and here were the results.

Figure 16: Fitting to a log(y) = (m)(log(x)) + b curve yields much stronger relationship.

Across the board, the strength of our linear relationship increases dramatically when first transforming both features by the log() function first. The new R2 is now 0.737 which means around 74% of the variance in our target feature can be explained by this mathematical relationship.

We can then predict a country’s deaths caused from air pollution in a given year by using the country’s GDP per capita with the following equation:

\[ log(Deaths_{from~air~pollution|per~year|per~country} / 100,000) = 7.38778 - 0.38952 * log(GDP_{per capita}) ~~~~~~~~~~~~~~~~ eqn (1) \]

or solving for our target variable:

\[ Deaths_{from~air~pollution|per~year|per~country} = 10^{7.38778 - 0.38952 * log(GDP per capita)} * 100,000 ~~~~~~~~~~~~~~~~ eqn (2) \]

Is there a difference in means of death caused by pollution between low, mid, and high GDP per capita?

We all know that correlation does not necessarily mean causation. Let us dig a little deeper and test if means of deaths caused by air pollution per 100k across different GDP per capita levels are equal or not.

One-Way ANOVA Test

We started off by performing a One-Way ANOVA test to determine if the means of deaths caused by air pollution per 100k across different GDP per capita levels are equal or not.

H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp

H1: At least one of \(\mu\)deaths_lowest_gdp, \(\mu\)deaths_low_gdp, \(\mu\)deaths_medium_gdp, \(\mu\)deaths_high_gdp is not equal

We will use an \(\alpha\) value of 0.05.

## $`1`
## [1]  95.2 920.3
## 
## $`2`
## [1]  921 3100
## 
## $`3`
## [1]  3104 11493
## 
## $`4`
## [1]  11497 119106
##               Df   Sum Sq Mean Sq F value              Pr(>F)    
## qnt            3 10735714 3578571    3133 <0.0000000000000002 ***
## Residuals   5193  5932162    1142                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-valuetest1 is 0e+00, which is lower than \(\alpha\)0.05. Therefore, we reject our null hypothesis that \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp. This means that there is statistically significant that at least one of the means of deaths in low, medium, and high GDP per capita are not the same.

2-Sample T-Tests

We will conduct 6 2-sample t-tests to determine if each of the groupings are different from each other:

  • Lowest GDP per capita’s deaths does not equal Low GDP per capita’s deaths
    • H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp
    • H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_low_gdp
  • Low GDP per capita’s deaths does not equal Medium GDP per capita’s deaths
    • H0: \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp
    • H1: \(\mu\)deaths_low_gdp != \(\mu\)deaths_medium_gdp
  • Medium GDP per capita’s deaths does not equal High GDP per capita’s deaths
    • H0: \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp
    • H1: \(\mu\)deaths_medium_gdp != \(\mu\)deaths_high_gdp
  • Lowest GDP per capita’s deaths does not equal High GDP per capita’s deaths
    • H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_high_gdp
    • H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_high_gdp
  • Lowest GDP per capita’s deaths does not equal Medium GDP per capita’s deaths
    • H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_medium_gdp
    • H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_medium_gdp
  • Low GDP per capita’s deaths does not equal Highest GDP per capita’s deaths
    • H0: \(\mu\)deaths_low_gdp = \(\mu\)deaths_high_gdp
    • H1: \(\mu\)deaths_low_gdp != \(\mu\)deaths_high_gdp

We used a two sample t-test for each and used an \(\alpha\) value of 0.05.

Test 1:

Conclusion of test1: p-valuetest1 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_low_gdp and accept our alternative hypothesis.

Test 2:

Conclusion of test2: p-valuetest2 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_low_gdp is equal to \(\mu\)deaths_medium_gdp and accept our alternative hypothesis.

Test 3:

Conclusion of test3: p-valuetest3 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_medium_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.

Test 4:

Conclusion of test4: p-valuetest4 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.

Test 5:

Conclusion of test5: p-valuetest5 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_medium_gdp and accept our alternative hypothesis.

Test 6:

Conclusion of test6: p-valuetest6 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_low_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.

Midterm Main Research Results

From all of our tests above, we can confirm that the means of deaths caused by air pollution are statistically significant when grouped by different levels of GDP per capita. This reinforces the idea that deaths caused by air pollution has a significant relationship with GDP per capita and the model can be quantified by Equation 2:

\[ Deaths_{from~air~pollution|per~year|per~country} = 10^{7.38778 - 0.38952 * log(GDP per capita)} * 100,000 ~~~~~~~~~~~~~~~~ eqn (2) \]

The strength of the correlation can be quantified by our R2 value of 0.737 from Figure 16.

6. Smart Questions for Further Modeling

1. What are the impacts of population size from GDP and Deaths due to Air Pollution globally?

  1. Categorizing GDP into low (0), medium(1) and high(2)
SQ6 <- final_df
library(dplyr)
groupings_pop <- SQ6 %>% mutate(population_grouping = case_when(Population.thousands >=  100001 & Population.thousands <= 1000000000 ~ 2,
                                             Population.thousands >= 50001  & Population.thousands <= 100001 ~ 1,
                                             Population.thousands >= 0 & Population.thousands <= 50000 ~ 0)) # end function
  1. Building a logit model GDP as predictor on Population
pop_gdpLogit <- glm(population_grouping ~ groupings_pop$gdp.per.capita, data = groupings_pop)
summary(pop_gdpLogit)
## 
## Call:
## glm(formula = population_grouping ~ groupings_pop$gdp.per.capita, 
##     data = groupings_pop)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.246  -0.185  -0.181  -0.180   1.820  
## 
## Coefficients:
##                                 Estimate  Std. Error t value
## (Intercept)                  0.180177461 0.008492388   21.22
## groupings_pop$gdp.per.capita 0.000000553 0.000000456    1.21
##                                         Pr(>|t|)    
## (Intercept)                  <0.0000000000000002 ***
## groupings_pop$gdp.per.capita                0.23    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.267)
## 
##     Null deviance: 1389.8  on 5196  degrees of freedom
## Residual deviance: 1389.4  on 5195  degrees of freedom
## AIC: 7899
## 
## Number of Fisher Scoring iterations: 2
  1. Bulding a logit model GDP and Deaths due to Air Population as predictor on Population
pop_gdp_deathsLogit <- glm(population_grouping ~ groupings_pop$gdp.per.capita + groupings_pop$Deaths.Air.Pollution.per.100k, data = groupings_pop)
summary(pop_gdp_deathsLogit)
## 
## Call:
## glm(formula = population_grouping ~ groupings_pop$gdp.per.capita + 
##     groupings_pop$Deaths.Air.Pollution.per.100k, data = groupings_pop)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.215  -0.194  -0.186  -0.169   1.842  
## 
## Coefficients:
##                                                 Estimate   Std. Error t value
## (Intercept)                                  0.202910224  0.018188095   11.16
## groupings_pop$gdp.per.capita                 0.000000136  0.000000543    0.25
## groupings_pop$Deaths.Air.Pollution.per.100k -0.000213150  0.000150811   -1.41
##                                                        Pr(>|t|)    
## (Intercept)                                 <0.0000000000000002 ***
## groupings_pop$gdp.per.capita                               0.80    
## groupings_pop$Deaths.Air.Pollution.per.100k                0.16    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.267)
## 
##     Null deviance: 1389.8  on 5196  degrees of freedom
## Residual deviance: 1388.9  on 5194  degrees of freedom
## AIC: 7899
## 
## Number of Fisher Scoring iterations: 2

2. What are the effects of (low or high) GDP and population size on deaths due to air pollution in Sub-Saharan Africa?

Here we use decision tree classification algorithm to predict which countries in Sub-Saharan Africa are more likely to have lower or higher deaths per 100,000 due to air pollution. We use predictors -GDP per capita and population size in thousands.

As we learned this semester, a qualitative response warrants the utilization of a classification tree. This helps us predict you predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs.

Below, we subset our data to only relect Sub-Saharan countries, and then we create a new variable deaths that recategorizes existing variable Deaths.Air.Pollution.per.100k into "high" or "low" based on the median deaths due to air pollution from 1990 to 2017.

Since our median is 143. 5, we set deaths = “high” if Deaths.Air.Pollution.per.100k is greater than or equal to 143, and deaths = “low” if Deaths.Air.Pollution.per.100k is less than 143. Lastly, we transform the deaths variable into a factor variable

##  [1] "ISO.alpha2.code"               "M49.code"                     
##  [3] "Year"                          "ISO.alpha3.code"              
##  [5] "Country.x"                     "Deaths.Air.Pollution.per.100k"
##  [7] "GDP.USD"                       "SDGRegion"                    
##  [9] "SubRegion"                     "Population.thousands"         
## [11] "geom"                          "gdp.per.capita"
##     ISO.alpha2.code M49.code Year ISO.alpha3.code Country.x
## 157              AO       24 1997             AGO    Angola
## 158              AO       24 1990             AGO    Angola
## 159              AO       24 1992             AGO    Angola
## 160              AO       24 1998             AGO    Angola
## 161              AO       24 2001             AGO    Angola
## 162              AO       24 2014             AGO    Angola
##     Deaths.Air.Pollution.per.100k      GDP.USD         SDGRegion    SubRegion
## 157                   202.2717564   7648377413 SUB-SAHARANAFRICA MiddleAfrica
## 158                   224.5086673  11228764963 SUB-SAHARANAFRICA MiddleAfrica
## 159                   219.9268744   8307810974 SUB-SAHARANAFRICA MiddleAfrica
## 160                   202.7319031   6506229607 SUB-SAHARANAFRICA MiddleAfrica
## 161                   183.0152710   8936063723 SUB-SAHARANAFRICA MiddleAfrica
## 162                   103.2917847 145712200313 SUB-SAHARANAFRICA MiddleAfrica
##     Population.thousands                           geom gdp.per.capita
## 157                14872 MULTIPOLYGON (((12.32243167...    514.2803532
## 158                11848 MULTIPOLYGON (((12.32243167...    947.7350577
## 159                12657 MULTIPOLYGON (((12.32243167...    656.3807358
## 160                15360 MULTIPOLYGON (((12.32243167...    423.5826567
## 161                16946 MULTIPOLYGON (((12.32243167...    527.3258423
## 162                26942 MULTIPOLYGON (((12.32243167...   5408.3661314
## 'data.frame':    1239 obs. of  12 variables:
##  $ ISO.alpha2.code              : Factor w/ 248 levels "AD","AE","AF",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ M49.code                     : Factor w/ 249 levels "4","8","10","12",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ Year                         : Factor w/ 28 levels "1990","1991",..: 8 1 3 9 12 25 13 2 18 17 ...
##  $ ISO.alpha3.code              : Factor w/ 197 levels "","AFG","AGO",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Country.x                    : Factor w/ 231 levels "Afghanistan",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ Deaths.Air.Pollution.per.100k: num  202 225 220 203 183 ...
##  $ GDP.USD                      : num  7648377413 11228764963 8307810974 6506229607 8936063723 ...
##  $ SDGRegion                    : Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ SubRegion                    : Factor w/ 22 levels "AUSTRALIA/NEWZEALAND",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ Population.thousands         : num  14872 11848 12657 15360 16946 ...
##  $ geom                         :sfc_MULTIPOLYGON of length 1239; first list element: List of 2
##   ..$ :List of 1
##   .. ..$ : num [1:66, 1:2] 12.3 12.2 12.7 12.9 13.2 ...
##   ..$ :List of 1
##   .. ..$ : num [1:9, 1:2] 13 12.6 12.3 11.9 12.2 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  $ gdp.per.capita               : num  514 948 656 424 527 ...
##  - attr(*, "na.action")= 'omit' Named int [1:56] 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 ...
##   ..- attr(*, "names")= chr [1:56] "5142" "5143" "5144" "5145" ...
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  42.3263 110.8553 143.4512 141.8158 170.7379 259.5389

Train the Model

The next thing we do is create a train and test set, this is done to train our model (train set) and test the prediction (test set). As taught in class, we split the data 80/20, 80% to train the model (991 observations), 20% to make predictions (248 observations). We verified this randomization process and see that low deaths in both sets are about 49% as shown below where we show the train and test breakdown of high vs low.

## 
##         high          low 
## 0.5045408678 0.4954591322
## 
##         high          low 
## 0.5040322581 0.4959677419

Next we build and train our model as deaths ~ gdp.per.capita + Population.thousands and plot the tree. It is important to note that the rpart() function we employ uses the Gini impurity measure to split the nodes. The Gini Impurity measures which features has least likelihood of misclassification.

## Call:
## rpart(formula = deaths ~ gdp.per.capita + Population.thousands, 
##     data = train, method = "class", control = list(maxdepth = 4))
##   n= 991 
## 
##              CP nsplit    rel error       xerror          xstd
## 1 0.47861507128      0 1.0000000000 1.0773930754 0.03198383265
## 2 0.04887983707      1 0.5213849287 0.5315682281 0.02823991365
## 3 0.01000000000      3 0.4236252546 0.4378818737 0.02642602072
## 
## Variable importance
##       gdp.per.capita Population.thousands 
##                   74                   26 
## 
## Node number 1: 991 observations,    complexity param=0.4786150713
##   predicted class=high  expected loss=0.4954591322  P(node) =1
##     class counts:   500   491
##    probabilities: 0.505 0.495 
##   left son=2 (672 obs) right son=3 (319 obs)
##   Primary splits:
##       gdp.per.capita       < 1064.866253 to the left,  improve=130.81631240, (0 missing)
##       Population.thousands < 1977.5      to the right, improve= 23.65467744, (0 missing)
##   Surrogate splits:
##       Population.thousands < 1300.5      to the right, agree=0.763, adj=0.263, (0 split)
## 
## Node number 2: 672 observations,    complexity param=0.04887983707
##   predicted class=high  expected loss=0.318452381  P(node) =0.6781029263
##     class counts:   458   214
##    probabilities: 0.682 0.318 
##   left son=4 (235 obs) right son=5 (437 obs)
##   Primary splits:
##       gdp.per.capita       < 347.5075869 to the left,  improve=42.27676140, (0 missing)
##       Population.thousands < 26171.5     to the left,  improve=20.09210181, (0 missing)
## 
## Node number 3: 319 observations
##   predicted class=low   expected loss=0.131661442  P(node) =0.3218970737
##     class counts:    42   277
##    probabilities: 0.132 0.868 
## 
## Node number 4: 235 observations
##   predicted class=high  expected loss=0.07659574468  P(node) =0.2371342079
##     class counts:   217    18
##    probabilities: 0.923 0.077 
## 
## Node number 5: 437 observations,    complexity param=0.04887983707
##   predicted class=high  expected loss=0.4485125858  P(node) =0.4409687185
##     class counts:   241   196
##    probabilities: 0.551 0.449 
##   left son=10 (371 obs) right son=11 (66 obs)
##   Primary splits:
##       Population.thousands < 25928.5     to the left,  improve=26.793946050, (0 missing)
##       gdp.per.capita       < 536.6554273 to the left,  improve= 2.834010564, (0 missing)
## 
## Node number 10: 371 observations
##   predicted class=high  expected loss=0.3746630728  P(node) =0.3743693239
##     class counts:   232   139
##    probabilities: 0.625 0.375 
## 
## Node number 11: 66 observations
##   predicted class=low   expected loss=0.1363636364  P(node) =0.06659939455
##     class counts:     9    57
##    probabilities: 0.136 0.864
## 
## Classification tree:
## rpart(formula = deaths ~ gdp.per.capita + Population.thousands, 
##     data = train, method = "class", control = list(maxdepth = 4))
## 
## Variables actually used in tree construction:
## [1] gdp.per.capita       Population.thousands
## 
## Root node error: 491/991 = 0.49545913
## 
## n= 991 
## 
##            CP nsplit  rel error     xerror        xstd
## 1 0.478615071      0 1.00000000 1.07739308 0.031983833
## 2 0.048879837      1 0.52138493 0.53156823 0.028239914
## 3 0.010000000      3 0.42362525 0.43788187 0.026426021

As shown above, our decision tree shows at the root node that 50% of countries in Sub-Saharan Africa have high deaths due to air pollution. Then, we see that 68% of these countries have GDP per capita that is less than $1,065 with probability of high deaths due to air pollution (32%). Then we see that out of these countries, 44% has gdp per capita NOT greater than $348 with probability of high deaths due to air pollution- 24% . Lastly, we see that 37% of the countries have population less than about 26,000,000 and GDP per capita greater than $348 and less than $1,065 - with probability of high deaths due to air pollution at 37%.

#Tree Plot
post(afit, file = "afit.ps", title = "Classification Tree for Deaths due to Air Pollution in Sub-Sahara Africa")

Predict

Here, we use the above test tree to predict it on our test set. We are predicting which countries are more likely to have higher deaths due to air pollution.

## high  low 
##  159   89

Evaluation

The next thing we do is evaluate our model’s performance using the confusion matrix, and from that, the accuracy test.

As we all know, TP: (True Positive) means predicted values correctly predicted as positive FP: (False Positive) means predicted values incorrectly predicted an actual positive FN: (False Negative) means positive values predicted as negative TN: (True Negative) means predicted values correctly predicted as an actual negative

#Evaluate

# Confusion matrix and Accuracy test
cm<-with(test, table(apred, deaths))
print(cm)
##       deaths
## apred  high low
##   high  110  49
##   low    15  74
xkabledply(cm, "confusion matrix")
confusion matrix
high low
high 110 49
low 15 74
accuracy <- sum(diag(cm)) / sum(cm)
print(paste('Accuracy for test', accuracy))
## [1] "Accuracy for test 0.741935483870968"

So we see above that our model accurately predicted 110 countries with high deaths due to air pollution (True negative) and 74 countries with low deaths due to air pollution (True positive), however it misclassified 49 countries for the former (False positive) and 15 countries for the latter(True negative).

The overall accuracy is 74%. Here the accuracy-test from the confusion matrix is calculated and is found to be 0.741935 Hence this model is found to predict with an accuracy of 74%.

Cross-Entropy: A third alternative, which is similar to the Gini Index, is known as the Cross-Entropy or Deviance:

Prune the tree

The last thing we want to do here is prune our tree, this will minimize overfitting the data by minimizing the cross-validated error.

#prune the tree 
pafit <- prune(afit, cp = afit$cptable[2,"CP"])
printcp(pafit)
## 
## Classification tree:
## rpart(formula = deaths ~ gdp.per.capita + Population.thousands, 
##     data = train, method = "class", control = list(maxdepth = 4))
## 
## Variables actually used in tree construction:
## [1] gdp.per.capita
## 
## Root node error: 491/991 = 0.49545913
## 
## n= 991 
## 
##            CP nsplit  rel error     xerror        xstd
## 1 0.478615071      0 1.00000000 1.07739308 0.031983833
## 2 0.048879837      1 0.52138493 0.53156823 0.028239914
#cp is cost complexity and has to be minimum.

# plot the pruned tree 
fancyRpartPlot(pafit,main="Pruned Classification Tree for Deaths due to Air Polution")

## high  low 
##  174   74
## [1] "Accuracy for test 0.741935483870968"

As shown above, accuracy is now 74%. Pruning had little effect even though the tree is now simpler. So, we have a model that can predict high deaths due to air pollution with 74% accuracy.

# create attractive postcript plot of tree 
post(afit, file = "afit.ps", title = "Pruned Classification Tree for Deaths due to Air Pollution in Sub-Sahara Africa")

How can we improve this model in the future?

Instead of of an individual tree, we can employ other techniques such as bagging, random forests, and boosting, which will significantly improve our predictive performance!

3. Can we let the data tell us what type of groupings exist in our dataset? How consistent are they to our preconceived groupings such as region or developed vs developing countries?

We wanted to better understand what types of groupings exist within our dataset. More concretely, we wanted to put aside our preconceived presumptions about our dataset and create a K-Means clustering model and allow the the data to guide us to the possible clusters that may exist within our data.

K-Means clustering is an unsupervised machine learning algorithm that is very useful to parse the dataset and identify K number of groups where observations within each group have high similarity, all without the need of labeled data.

To begin our process, let’s first create an index label to allow mapping each datapoint back to a country after clustering.

Figure 6.3-0: Head of final_df_cluster
index_ region Deaths.Air.Pollution.per.100k gdp.per.capita Population.thousands
Australia Australia AUSTRALIA/NEWZEALAND 17.76814718 35411.5137067 20282.892857
New Zealand New Zealand AUSTRALIA/NEWZEALAND 15.92536379 25278.8349742 4056.678571
Afghanistan Afghanistan CENTRALANDSOUTHERNASIA 227.76525452 430.2877879 29282.000000
Bangladesh Bangladesh CENTRALANDSOUTHERNASIA 144.51373473 619.5903166 133735.607143
Bhutan Bhutan CENTRALANDSOUTHERNASIA 115.89647264 1408.2427343 627.000000
India India CENTRALANDSOUTHERNASIA 165.91366818 838.3205492 1115445.428571

Next, let’s select only the numerical values from our dataset to use in clustering.

Figure 6.3-1: Structure of final_df_numerical
variable class first_values
Deaths.Air.Pollution.per.100k double 17.768147178761, 15.9253637864585, 227.765254524006, 144.513734731741, 115.896472640701, 165.913668178177
gdp.per.capita double 35411.5137067032, 25278.8349742491, 430.287787933406, 619.590316576761, 1408.24273427377, 838.320549169855
Population.thousands double 20282.8928571429, 4056.67857142857, 29282, 133735.607142857, 627, 1115445.42857143

Optimal K Clusters

The key hyperparameter that we will need to optimize for is the optimal K value to achieve best results of low model complexity but still capturing relevant clusters.

The 2 methods to find optimal K is the following:

  1. Elbow Method with Total Within Sum of Squares
  2. Looking at the Gap Statistics

We will try both. Let’s begin with the first method.

Elbow Method with Total Within Sum of Squares

Figure 6.3-2: Finding the optimal number of clusters using the Within Sum of Squares Method.

From the total Within Sum of Squares plot, we find that the optimal number of K seems to be at 4 using the Elbow method which identifies the K number where the Total Sum of Squares begins to level off, or where adding additional K clusters only improves our model marginally.

Looking at the Gap Statistics

Figure 6.3-3: Finding the optimal number of clusters using the Gap Statistic.

From the Gap Statistic plot, we find that the optimal number of K seems to be at 10 with the highest gap statistic, however, we observed using the Elbow method that the Total Sum of Squares begins to level off from 4 to 10. We will stick with using K = 4.

Perform K-Means Clustering with Optimal K

We perform a K-Means clustering analysis with a K of 4 and here are the results below.

Figure 6.3-4: K-Means Clustering Average Values per Cluster
Cluster Deaths.Air.Pollution.per.100k gdp.per.capita Population.thousands
1 1.16234711051454 -0.61738173258503 -0.101796957671015
2 1.01091758337588 -0.564333841794921 9.25163623841697
3 -1.04092828247631 1.92017020504228 -0.0578237235004342
4 -0.494528110744794 -0.253250480051675 -0.107965219042902
Figure 6.3-5: Number of Observations Within Each Cluster
Cluster Observations with Each Cluster
1 67
2 2
3 34
4 90
Figure 6.3-6: Total Sum of Squares Within Each Cluster
Cluster Total Sum of Squares with Each Cluster
1 30.8781
2 1.5231
3 34.8756
4 26.9591

Let’s take a look at clusters visually. Interestingly, our model has grouped China and India into their own separate cluster, Cluster 2. Clusters 4, 3, and 1 can be roughly characterized by high GDP per captita nations, medium GDP per capita nations, and low GDP per capita nations’.

Figure 3.6-7: Visualing K-Means Clustering Model.

Let’s take a closer look at each cluster’s breakdown of subregions more closely to determine what type of discrepancies we find between each cluster.

Figure 6.3-8: Frequency % of SDGRegion by Clusters.

Interestingly, we observe that Cluster 1 is disproportionately comprised of countries in Sub-Saharan Africa with large representations from Oceania, Eastern & South-Eastern Asia, and Central & Southern Asia. Cluster 2 is the one cluster comprised solely of China and India. Cluster 3 has a much larger proportion of countries from Europe, Northen America, and Northern Africa & Western Asia. Cluster 4 is similar to Cluster 3 except it does not have countries from Northern America and has more countries from Sub-Saharan Africa and Central & Southern Asia.

4. Can we make a prediction of the future GDP by considering the population size, location, and the reation of population and deaths due to air pollution?

For our major model we make a nice prediction for GDP through total deaths caused by air pollution, so as we can see that through the distribution of the data, there are strong relationships between total deaths caused by air pollution, deaths caused by indoor air pollution, and deaths caused by outdoor air pollution.

So it seems like we can also get a nice GDP prediction by using deaths caused by indoor air pollution and deaths caused by outdoor air pollution as the basis of the prediction model.

For the first model, we build a linear model using ‘gdp.per.capita’ and ‘Deaths.Air.Pollution.Indoor.per.100k’.

## 
## Call:
## lm(formula = gdp.per.capita ~ Deaths.Air.Pollution.Indoor.per.100k, 
##     data = final_df_sfc)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -15508268  -8789754  -3652222   3500260 102785447 
## 
## Coefficients:
##                                          Estimate   Std. Error   t value
## (Intercept)                          16339919.456   254680.707  64.15845
## Deaths.Air.Pollution.Indoor.per.100k  -126646.729     3307.569 -38.28997
##                                                    Pr(>|t|)    
## (Intercept)                          < 0.000000000000000222 ***
## Deaths.Air.Pollution.Indoor.per.100k < 0.000000000000000222 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13887560 on 5195 degrees of freedom
## Multiple R-squared:  0.2201014,  Adjusted R-squared:  0.2199512 
## F-statistic: 1466.122 on 1 and 5195 DF,  p-value: < 0.00000000000000022204
Figure 6.4-1: Deaths due to Indoor Air Pollution vs GDP per capita.

The R-squared value of this model is 0.22, which is a really bad result. So follow the steps in the major model building, now we put these two features into a log() and it should performed better.

## 
## Call:
## lm(formula = log(gdp.per.capita) ~ log(Deaths.Air.Pollution.Indoor.per.100k), 
##     data = final_df_sfc)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.3535187 -0.5992551  0.0980527  0.6472414  2.8921185 
## 
## Coefficients:
##                                               Estimate   Std. Error   t value
## (Intercept)                               16.332565271  0.017626039  926.6157
## log(Deaths.Air.Pollution.Indoor.per.100k) -0.547607424  0.005167888 -105.9635
##                                                         Pr(>|t|)    
## (Intercept)                               < 0.000000000000000222 ***
## log(Deaths.Air.Pollution.Indoor.per.100k) < 0.000000000000000222 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8906039 on 5195 degrees of freedom
## Multiple R-squared:  0.6836804,  Adjusted R-squared:  0.6836195 
## F-statistic: 11228.26 on 1 and 5195 DF,  p-value: < 0.00000000000000022204
Figure 6.4-2: Log Deaths due to Indoor Air Pollution vs Log GDP per capita.

As we can see, although it performed much better with a R-squared value 0.68 but this is still not good as the major model.

Next we try the deaths caused by outdoor air pollution ‘Deaths.Air.Pollution.Outdoor.Total.per.100k’.

## 
## Call:
## lm(formula = gdp.per.capita ~ Deaths.Air.Pollution.Outdoor.Total.per.100k, 
##     data = final_df_sfc)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -14051236  -9537840  -5335898   2518144 106063637 
## 
## Coefficients:
##                                                Estimate  Std. Error   t value
## (Intercept)                                 15643215.32   461879.67  33.86859
## Deaths.Air.Pollution.Outdoor.Total.per.100k  -150382.23    10830.39 -13.88521
##                                                           Pr(>|t|)    
## (Intercept)                                 < 0.000000000000000222 ***
## Deaths.Air.Pollution.Outdoor.Total.per.100k < 0.000000000000000222 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15441650 on 5195 degrees of freedom
## Multiple R-squared:  0.0357844,  Adjusted R-squared:  0.03559879 
## F-statistic: 192.7991 on 1 and 5195 DF,  p-value: < 0.00000000000000022204
Figure 6.4-3: Deaths due to Outdoor Air Pollution vs GDP per capita.

The performence of this model is also bad, with a R-squared value 0.036, so we put the features in the log() too.

## 
## Call:
## lm(formula = log(gdp.per.capita) ~ log(Deaths.Air.Pollution.Outdoor.Total.per.100k), 
##     data = final_df_sfc)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.5907203 -1.2717139 -0.0273497  1.2970215  3.4713403 
## 
## Coefficients:
##                                                     Estimate  Std. Error
## (Intercept)                                      15.69882895  0.15654605
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k) -0.19907742  0.04418098
##                                                    t value
## (Intercept)                                      100.28250
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k)  -4.50595
##                                                                Pr(>|t|)    
## (Intercept)                                      < 0.000000000000000222 ***
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k)           0.0000067525 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.580427 on 5195 degrees of freedom
## Multiple R-squared:  0.003893084,    Adjusted R-squared:  0.00370134 
## F-statistic: 20.30361 on 1 and 5195 DF,  p-value: 0.000006752534
Figure 6.4-4: Log Deaths due to Outdoor Air Pollution vs Log GDP per capita.

A surprised point is that although we put the variables into log(), the performance of model is still bad and even worse. The R-squared value of this model is 0.0039.

Finally, we are going to test build a model base on both deaths caused by outdoor and indoor air pollution. To check if this model also works bad.

## 
## Call:
## lm(formula = gdp.per.capita ~ Deaths.Air.Pollution.Indoor.per.100k + 
##     Deaths.Air.Pollution.Outdoor.Total.per.100k, data = final_df_sfc)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -18198746  -7911990  -2886359   3411956  96919910 
## 
## Coefficients:
##                                                 Estimate   Std. Error   t value
## (Intercept)                                 26403084.065   457576.540  57.70201
## Deaths.Air.Pollution.Indoor.per.100k         -144488.737     3189.786 -45.29732
## Deaths.Air.Pollution.Outdoor.Total.per.100k  -242554.830     9393.524 -25.82149
##                                                           Pr(>|t|)    
## (Intercept)                                 < 0.000000000000000222 ***
## Deaths.Air.Pollution.Indoor.per.100k        < 0.000000000000000222 ***
## Deaths.Air.Pollution.Outdoor.Total.per.100k < 0.000000000000000222 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13075010 on 5194 degrees of freedom
## Multiple R-squared:  0.3088267,  Adjusted R-squared:  0.3085606 
## F-statistic: 1160.379 on 2 and 5194 DF,  p-value: < 0.00000000000000022204
Figure 6.4-5: Deaths due to Indoor + Outdoor Air Pollution vs GDP per capita.

The R-squared value of this model is 0.29. Next we put these variables into log().

## 
## Call:
## lm(formula = log(gdp.per.capita) ~ log(Deaths.Air.Pollution.Outdoor.Total.per.100k) + 
##     log(Deaths.Air.Pollution.Indoor.per.100k), data = final_df_sfc)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.1996055 -0.5942513  0.0554719  0.6207513  2.9007194 
## 
## Coefficients:
##                                                      Estimate   Std. Error
## (Intercept)                                      17.606051113  0.088198404
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k) -0.359853980  0.024440072
## log(Deaths.Air.Pollution.Indoor.per.100k)        -0.552122333  0.005073061
##                                                     t value
## (Intercept)                                       199.61870
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k)  -14.72393
## log(Deaths.Air.Pollution.Indoor.per.100k)        -108.83415
##                                                                Pr(>|t|)    
## (Intercept)                                      < 0.000000000000000222 ***
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k) < 0.000000000000000222 ***
## log(Deaths.Air.Pollution.Indoor.per.100k)        < 0.000000000000000222 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8726636 on 5194 degrees of freedom
## Multiple R-squared:  0.6963544,  Adjusted R-squared:  0.6962374 
## F-statistic: 5955.733 on 2 and 5194 DF,  p-value: < 0.00000000000000022204
Figure 6.4-6: Log Deaths due to Indoor + Outdoor Air Pollution vs Log GDP per capita.

For this model the R-squared value is 0.69. This is actually not a bad result but when comparing to the major model using total deaths caused by air pollution, 0.69 is not good enough for predicting GDP.

We can get the conclusion that we can not get a nice GDP prediction by using deaths caused by indoor air pollution and deaths caused by outdoor air pollution as the basis of the prediction model.

7. Conclusion of SMART Modeling Questions

  1. For the Logit Regression, we observed small p-values indicating that all the coefficients are found to be significant. GDP has a positive effect on population, but deaths due to air pollution have a negative effect on population.

  2. As shown above, post-pruning, our accuracy is still 74%. So, pruning had little effect even though the tree is now much simpler. Therefore, we have a model that can predict high deaths due to air pollution with 74% accuracy.

  3. For the clustering, it is really interesting to observe that the main clusters 1, 3, and 4 have large discrepancies in regional representation while the data we used for clustering did not have any explicit geospatial components. This can mean that regionality, although may not be the cause, does have some correlation in determining the groups of factors in GDP per Capita, Population, and Deaths due to Air Pollution per 100k.

  4. Although death caused by indoor air pollution and outdoor air pollution seems to have a strong relationship with total death caused by air pollution, but we can’t accurately predict GDP through death caused by indoor and outdoor air pollution.

8. Bibliography

Figure 8.1: References
Number APA Citation
1 Robin Lovelace, J. N. (n.d.). Chapter 8 Making maps with R: Geocomputation with R. Retrieved October 28, 2021, from https://geocompr.robinlovelace.net/adv-map.html
2 Robin Lovelace, J. N. (2021, October 28). Chapter 2 Geographic data in R: Geocomputation with R. Retrieved from https://geocompr.robinlovelace.net/spatial-class.html#intro-sf
3 Hadley Wickham, D. N. (2021, October 28). 6 Maps. Retrieved from https://ggplot2-book.org/maps.html
4 Customizing ggplot2 color and fill scales. (2021, October 28). Retrieved from https://spielmanlab.github.io/introverse/articles/color_fill_scales.html
5 Logarithmic Functions. (2021, October 28). Retrieved from https://saylordotorg.github.io/text_intermediate-algebra/s10-03-logarithmic-functions-and-thei.html
6 K-Means Clustering in R. (2021, December 3). Retrieved from https://www.statology.org/k-means-clustering-in-r/
7 Confusion Matrix in Machine Learning. (2021, December 3). Retrieved from https://www.guru99.com/confusion-matrix-machine-learning-example.html